Day 3: PCA analysis and clustering demostration
1. Creating a synthetic Fungal Trait Dataset
Let’s start by generating a simple synthetic dataset of 4 fungal traits to show the analysis process. We’ll imagine 10 hypothetical fungal species (labeled A through J for simplicity). For each species, we have measured:
- HyphalExtensionRate (mm/day) – e.g. how fast the fungus’s mycelium grows.
- EnzymeActivity (arbitrary units) – representing extracellular enzyme production.
- SporeProduction (count per culture or per unit time) – number of spores produced (indicates reproductive output and dispersal potential).
- DroughtTolerance (percentage survival or growth under dry conditions) – a measure of stress tolerance (higher means the fungus can withstand drought stress better).
Create synthetic fungal trait data
(In practice, this might be loaded from a file or measured experimentally, but here we explicitly define it for demonstration.)
# Define species names
species <- c("Species_A","Species_B","Species_C","Species_D",
"Species_E","Species_F","Species_G","Species_H",
"Species_I","Species_J")
# Manually assign trait values for each species:
hyphal_ext <- c(10, 9, 12, 11, # A-D: high extension rates
2, 4, 3, 5, # E-H: low extension rates
6, 7) # I-J: moderate extension
enzyme_activity <- c(2, 4, 3, 2, # A-D: low enzyme activity
9, 8, 7, 9, # E-H: high enzyme activity
4, 5) # I-J: moderate enzyme
spore_prod <- c(900, 1100, 950, 1200, # A-D: high spore production
80, 150, 120, 90, # E-H: low spore production
1000, 800) # I-J: high spore (similar to A-D)
drought_tol <- c(25, 15, 30, 20, # A-D: low drought tolerance (%)
85, 75, 80, 90, # E-H: high drought tolerance
80, 70) # I-J: high drought tolerance (like E-H)
# Combine into a data frame
trait_data <- data.frame(Species = species,
HyphalExtensionRate = hyphal_ext,
EnzymeActivity = enzyme_activity,
SporeProduction = spore_prod,
DroughtTolerance = drought_tol,
stringsAsFactors = FALSE)
# Take a quick look at the dataset
trait_data## Species HyphalExtensionRate EnzymeActivity SporeProduction
## 1 Species_A 10 2 900
## 2 Species_B 9 4 1100
## 3 Species_C 12 3 950
## 4 Species_D 11 2 1200
## 5 Species_E 2 9 80
## 6 Species_F 4 8 150
## 7 Species_G 3 7 120
## 8 Species_H 5 9 90
## 9 Species_I 6 4 1000
## 10 Species_J 7 5 800
## DroughtTolerance
## 1 25
## 2 15
## 3 30
## 4 20
## 5 85
## 6 75
## 7 80
## 8 90
## 9 80
## 10 70
Exploratory Data Analysis
It’s good practice to do some quick exploration of a new dataset:
## 'data.frame': 10 obs. of 5 variables:
## $ Species : chr "Species_A" "Species_B" "Species_C" "Species_D" ...
## $ HyphalExtensionRate: num 10 9 12 11 2 4 3 5 6 7
## $ EnzymeActivity : num 2 4 3 2 9 8 7 9 4 5
## $ SporeProduction : num 900 1100 950 1200 80 150 120 90 1000 800
## $ DroughtTolerance : num 25 15 30 20 85 75 80 90 80 70
## HyphalExtensionRate EnzymeActivity SporeProduction DroughtTolerance
## Min. : 2.00 Min. :2.00 Min. : 80.0 Min. :15.00
## 1st Qu.: 4.25 1st Qu.:3.25 1st Qu.: 127.5 1st Qu.:26.25
## Median : 6.50 Median :4.50 Median : 850.0 Median :72.50
## Mean : 6.90 Mean :5.30 Mean : 639.0 Mean :57.00
## 3rd Qu.: 9.75 3rd Qu.:7.75 3rd Qu.: 987.5 3rd Qu.:80.00
## Max. :12.00 Max. :9.00 Max. :1200.0 Max. :90.00
We exclude the first column (Species) in summary because it’s not numeric. The summary will show basic stats (min, median, mean, max) for each trait.
You can already see differences in scale: e.g. SporeProduction numbers are much larger (hundreds) compared to the others. HyphalExtensionRate ranges roughly 2–12, EnzymeActivity 2–9, DroughtTolerance 15–90 (%). Because of these scale differences, we will need to normalize the traits before PCA and distance calculations, or else SporeProduction (with values ~1000 vs ~100) would dominate the variance.
2. Data preparation: Scaling the Traits
For PCA (and clustering based on Euclidean distance), it’s crucial to scale the traits so that they are comparable. Typically we scale each trait to mean 0 and standard deviation 1 (also known as z-scores). This way, a difference of, say, 50 in spore count (which might be small relative to its range) is on the same footing as a difference of 5°C in a growth temperature trait (if we had one, for example). In R, we can scale easily:
# Scale the trait variables (exclude Species column)
trait_data_numeric <- trait_data[, -1] # drop Species column, keep numeric traits
trait_data_scaled <- scale(trait_data_numeric) # scale() centers and scales columns by default
# Check that scaling worked:
colMeans(trait_data_scaled) # should be ~0 for each trait## HyphalExtensionRate EnzymeActivity SporeProduction DroughtTolerance
## -6.800116e-17 -2.775558e-18 1.110223e-17 2.775558e-17
## HyphalExtensionRate EnzymeActivity SporeProduction DroughtTolerance
## 1 1 1 1
We create trait_data_numeric as just the numeric columns for convenience. After scaling, it’s wise to verify the result – the means should be essentially 0 and SDs 1 (due to rounding, you might see extremely small scientific notation values like 1e-16 instead of 0). Now our data is ready for PCA and clustering.
3. Principal Component Analysis in R
We will use the base R function prcomp() for PCA. Since
we already scaled the data, we can tell prcomp not to scale
again by setting scale. = FALS:
# Perform PCA on the scaled trait data
pca_result <- prcomp(trait_data_scaled, center = FALSE, scale. = FALSE)
# If we hadn't scaled manually, we would do: prcomp(trait_data_numeric, center=TRUE, scale.=TRUE)
# Print PCA summary
summary(pca_result)## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.8957 0.49515 0.31924 0.24354
## Proportion of Variance 0.8984 0.06129 0.02548 0.01483
## Cumulative Proportion 0.8984 0.95969 0.98517 1.00000
The summary(pca_result) will show the Proportion of Variance explained by each principal component (PC). Because we have 4 original traits, we will get 4 PCs. We expect the first one or two to explain most of the variance if our traits are highly correlated (which they are by construction).
We can also examine the PCA loadings and scores:
Loadings (also called rotation in
prcompoutput) tell us how each trait contributes to each principal component. These are found inpca_result$rotation. Essentially, it’s a matrix where columns are PCs and rows are our original traits, and values are the coefficients (usually between -1 and 1) for the linear combination. For example, if PC1 has a high positive loading for EnzymeActivity and DroughtTolerance, and a negative loading for HyphalExtensionRate, that means PC1 almost can be seen as a “high enzyme & tolerance vs. low growth” axis.Scores (coordinates of each species on the PCs) are in
pca_result$x. Each row corresponds to a species, and columns are PC1, PC2, etc. These tell us where each species lies in the new reduced-dimensional space.
## PC1 PC2 PC3 PC4
## HyphalExtensionRate -0.5052261 0.2290401 -0.82217833 0.12771068
## EnzymeActivity 0.5091733 0.3596199 -0.09208987 0.77648925
## SporeProduction -0.4990295 -0.5583885 0.24664158 0.61509327
## DroughtTolerance 0.4862701 -0.7116294 -0.50468821 -0.04913958
# PCA scores (species coordinates on PCs):
pca_scores <- pca_result$x
head(pca_scores, 10) # print scores for all 10 species (or use head if more cases)## PC1 PC2 PC3 PC4
## [1,] -1.85152529 0.21066098 0.04681947 -0.422888781
## [2,] -1.70939493 0.40182028 0.48772972 0.384027949
## [3,] -1.93021473 0.29632309 -0.51606880 -0.009540253
## [4,] -2.39671543 0.03561297 0.05162958 0.016254420
## [5,] 2.44079157 0.17239291 0.27460225 0.084470346
## [6,] 1.73052404 0.32400947 0.03834687 -0.016198171
## [7,] 1.80267117 0.04613242 0.20932698 -0.382714485
## [8,] 2.07441469 0.24089645 -0.51225096 0.199672049
## [9,] -0.12683590 -1.19864032 0.06451442 0.037322297
## [10,] -0.03371518 -0.52920826 -0.14464953 0.109594628
Interpreting the loadings is key to understanding the PCs. Look at the signs and magnitudes for PC1 and PC2 especially.
Visualise PCA plot
Next, let’s visualize the PCA results. A common
approach is a biplot or plotting PC1 vs PC2. We can use the base R
biplot function, which plots species scores and trait
loadings together:
By default this will plot PC1 on the x-axis and PC2 on the y-axis. The points (usually labeled by rownames or numbers) represent the species in PCA space, and the arrows represent the traits (loadings). Because we provided no explicit labels, species might be labeled as numbers 1–10 corresponding to the row order. To improve this, we can add our species names as labels:
(If you prefer ggplot2, one can also make PCA plots by extracting pca_scores and plotting PC1 vs PC2 with species labels. Another quick method is autoplot(pca_result) from the ggfortify package)
ggfortify lets ggplot2 know how to
interpret PCA objects. After loading ggfortify, you can use
ggplot2::autoplot function for stats::prcomp
and stats::princomp objects.
library(ggplot2)
library(plotly)
library(ggfortify)
p <- autoplot(pca_result) #basic plot
p <- autoplot(pca_result, data = trait_data, colour = "Species") #color points based on metadata
p <- autoplot(pca_result, data = trait_data, colour = "Species",
loadings = TRUE, loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 3) #display eigenvectors and attach eigenvector labels
ggplotly(p) #basic plot4. Hierarchical Clustering
Now that we’ve explored the continuous trait space, let’s perform clustering to explicitly group the species.
We need to decide on a distance metric and a
linkage method. For simplicity, we’ll use
Euclidean distance on the scaled trait data (this is
the straight-line distance in 4-dimensional trait space between
species). Since our PCA was also based on Euclidean variance, this is
consistent. For linkage, we use the defult method complete
(uses the furthest distance between points in two clusters) here.
# Compute Euclidean distance on scaled data
dist_matrix <- dist(trait_data_scaled, method = "euclidean")
# Perform hierarchical clustering
hc <- hclust(dist_matrix)
# Plot the dendrogram
plot(hc, labels = trait_data$Species, main="Hierarchical Clustering of Fungi")To make concrete clusters, we can cut the dendrogram at a certain number of clusters. Let’s say we suspect 3 clusters (fast group, slow group, intermediate group):
## [1] 1 1 1 1 2 2 2 2 3 3
This will output a numeric cluster label (1,2,3) for each species in order. You can also table it against species:
# Cut tree into 3 clusters
# See which species are in which cluster
data.frame(Species = trait_data$Species, Cluster = cluster_membership)## Species Cluster
## 1 Species_A 1
## 2 Species_B 1
## 3 Species_C 1
## 4 Species_D 1
## 5 Species_E 2
## 6 Species_F 2
## 7 Species_G 2
## 8 Species_H 2
## 9 Species_I 3
## 10 Species_J 3